On Using Class-Labels in Evaluation of Clusterings

نویسندگان

  • Ines Färber
  • Stephan Günnemann
  • Hans-Peter Kriegel
  • Peer Kröger
  • Emmanuel Müller
  • Erich Schubert
  • Thomas Seidl
  • Arthur Zimek
چکیده

Although clustering has been studied for several decades, the fundamental problem of a valid evaluation has not yet been solved. The sound evaluation of clustering results in particular on real data is inherently difficult. In the literature, new clustering algorithms and their results are often externally evaluated with respect to an existing class labeling. These class-labels, however, may not be adequate for the structure of the data or the evaluated cluster model. Here, we survey the literature of different related research areas that have observed this problem. We discuss common “defects” that clustering algorithms exhibit w.r.t. this evaluation, and show them on several real world data sets of different domains along with a discussion why the detected clusters do not indicate a bad performance of the algorithm but are valid and useful results. An useful alternative evaluation method requires more extensive data labeling than the commonly used class labels or it needs a combination of information measures to take subgroups, supergroups, and overlapping sets of traditional classes into account. Finally, we discuss an evaluation scenario that regards the possible existence of several complementary sets of labels and hope to stimulate the discussion among different sub-communities — like ensemble-clustering, subspace-clustering, multi-label classification, hierarchical classification or hierarchical clustering, and multiview-clustering or alternative clustering — regarding requirements on enhanced evaluation methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in th...

متن کامل

Exploiting Associations between Class Labels in Multi-label Classification

Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...

متن کامل

AN INFORMATION - THEORETIC EXTERNAL CLUSTER - VALIDITYMEASUREByron

This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speciic requests. After outside...

متن کامل

Evaluation and Comparison of Sodium in High Consumption Foods with the Amount Reported on Nutritional Label in Kermanshah

Background & objectives: Proper food labeling can help improve a diet and health. Due to the importance of sodium in the diet, determining the amount of sodium in food and comparing it with the amount reported on nutrition labels was the objective of the present study. Methods: In this study, 96 high-consumption foods were examined in 5 groups, including meat and protein products, dairy produc...

متن کامل

Evaluating Clusterings by Estimating Clarity

In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness. I begin by reviewing clustering in general. I then review current clustering quality measures, accompanying this with an in-depth discussion of many of th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010